Regional Analysis of Preterm Birth Characteristics
BMIN503/EPID600 Final Project
Author
Rose Albert
1 Interrogating sex as a biological variable in county-level preterm birth inequity
December 13, 2024
By Rose Albert
1.1 Overview
Preterm birth affects approximately 10% of all births, and there is a well-established male vulnerability and female resilience during the neonatal period. The goal of this project is to interrogate sex as a biological variable at the county-level using CDC Wonder Natality Data and identify sex-differences in preterm birth rates and NICU admission. I have spoken with Dr. Aimin Chen and Dr. Heather Burris about biological and structural determinants contributing to preterm birth inequity, as well as potential areas of clinical, public health, and policy intervention.
Preterm birth can predispose infants to immediate risks in the neonatal period as well as long-term adverse sequelae. For example, preterm infants are born with underdeveloped lungs and often require therapeutic interventions such as surfactant administration, antibiotics, steroids, mechanical ventilation, and supplemental oxygen for survival. However, these therapeutics can also pose postnatal insults and contribute to lung injury and diseases such as bronchopulmonary dysplasia.
Interrogation of sex as a biological variable has received increasing attention as a critical area for understanding basic pathophysiology and differential response to therapeutics. Male sex is an independent risk factor for numerous pediatric diseases and diseases of prematurity. This project will investigate male susceptibility to preterm birth and NICU admission at the county-level. An understanding of preterm birth characteristics can inform local public health interventions.
1.3 Methods
Data curation and cleaning
County-level Birth Characteristics Data Source: I am using the CDC Wonder Natality Data (2023) grouped by “County of Residence,” “Sex of Infant,” “NICU Admission,” and “OE Gestational Age Recode 10,” with the measures “Births” and “Percent of Total Births.”
Mapping Data Source: I am drawing from Assignment 5 to use geospatial data, tidy census, and leaflet to generate interactive maps.update.packages(“tidycensus”)
#load necessary librarieslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(sf)
Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(tigris)
To enable caching of data, set `options(tigris_use_cache = TRUE)`
in your R script or .Rprofile.
library(tidycensus)library(leaflet)options(tigris_use_cache =TRUE)options(progress_enabled =FALSE)#Load county-level birth data downloaded from CDC Wonderbirth_data <-read.delim("https://raw.githubusercontent.com/rosemalbert/BMIN503_Final_Project/refs/heads/master/Natality%2C%202016-2023%20expanded_sex.txt")county_data <-read.delim("https://raw.githubusercontent.com/rosemalbert/BMIN503_Final_Project/refs/heads/master/Natality%2C%202016-2023%20expanded%20_county.txt")#view column namescolnames(birth_data)
#merge birth_data and county_datamerged_data <-left_join(birth_data, county_data, by ="County.of.Residence.Code")
Warning in left_join(birth_data, county_data, by = "County.of.Residence.Code"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 11803 of `x` matches multiple rows in `y`.
ℹ Row 250 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
#clean merged_data library(dplyr)# clean merged data by removing unnecessary columns, checking discrepancies, and renamingcleaned_data <- merged_data %>%# Remove 'Notes.x' columnselect(-Notes.x) %>%# Check for discrepancies between 'County.of.Residence.x' and 'County.of.Residence.y'# mutate(County.of.Residence =ifelse( County.of.Residence.x == County.of.Residence.y, County.of.Residence.x, NA ) ) %>%# Merge 'County.of.Residence.x' and 'County.of.Residence.y'select(-County.of.Residence.x, -County.of.Residence.y) %>%# Rename 'Births.y' to 'Total.Births'rename(Total.Births = Births.y ) %>%# Remove 'Notes.y' columnselect(-Notes.y) %>%# Remove sex of infant codeselect(-Sex.of.Infant.Code) %>%# Remove 'X..of.Total.Births.y' and 'X..of.Total.Births.x'select(-c(X..of.Total.Births.x, X..of.Total.Births.y)) %>%# Remove 'OE.Gestational.Age.Recode.10.Code'select(-OE.Gestational.Age.Recode.10.Code) %>%# Remove 'NICU.Admission.Code'select(-NICU.Admission.Code) %>%# Filter out rows where 'NICU.Admission' or 'OE.Gestational.Age.Recode.10' are "Unknown"filter(!NICU.Admission %in%c("Unknown or Not Stated", ""),!OE.Gestational.Age.Recode.10%in%c("Unknown or Not Stated", "") ) %>%# Remove any duplicatesdistinct()cleaned_data <- cleaned_data %>%mutate(County.of.Residence.Code =as.character(County.of.Residence.Code))# View cleaned datahead(cleaned_data)
County.of.Residence.Code Sex.of.Infant NICU.Admission
1 24033 Female No
2 24033 Male No
3 24037 Male Yes
4 24999 Female No
5 24999 Male Yes
6 25003 Female No
OE.Gestational.Age.Recode.10 Births.x Total.Births Total.Population
1 28 - 31 weeks 10 10527 947430
2 42 weeks or more 10 10527 947430
3 32 - 35 weeks 10 1309 115281
4 28 - 31 weeks 10 5546 550411
5 40 weeks 10 5546 550411
6 36 weeks 10 890 126818
Birth.Rate County.of.Residence
1 11.11 Prince George's County, MD
2 11.11 Prince George's County, MD
3 11.35 St. Mary's County, MD
4 10.08 Unidentified Counties, MD
5 10.08 Unidentified Counties, MD
6 7.02 Berkshire County, MA
# Convert 'County.of.Residence.Code' to character before mergingpreterm_rate_data <- cleaned_data %>%mutate(Is.Preterm = OE.Gestational.Age.Recode.10%in%c("28 - 31 weeks", "32 - 35 weeks", "36 weeks", "20 - 27 weeks", "Under 20 weeks"),County.of.Residence.Code =as.character(County.of.Residence.Code) # Convert to character ) %>%group_by(County.of.Residence, County.of.Residence.Code) %>%# Keep 'County.of.Residence.Code' in the group_bysummarise(Total.Births =sum(Total.Births, na.rm =TRUE),Preterm.Births =sum(ifelse(Is.Preterm, Births.x, 0), na.rm =TRUE),Preterm.Birth.Rate = (Preterm.Births / Total.Births) *100 ) %>%ungroup()
`summarise()` has grouped output by 'County.of.Residence'. You can override
using the `.groups` argument.
# View the result to ensure 'County.of.Residence.Code' is retainedhead(preterm_rate_data)
# A tibble: 6 × 5
County.of.Residence County.of.Residence.Code Total.Births Preterm.Births
<chr> <chr> <int> <dbl>
1 Ada County, ID 16001 122064 408
2 Adams County, CO 8001 139436 566
3 Adams County, PA 42001 9130 26
4 Aiken County, SC 45003 26698 169
5 Alachua County, FL 12001 48317 266
6 Alamance County, NC 37001 28320 158
# ℹ 1 more variable: Preterm.Birth.Rate <dbl>
#load census shapefile data and merge by county FIPs counties_sf <-counties(cb =TRUE, resolution ="20m") # cb = TRUE for a simplified geometry
Retrieving data for the year 2022
# Merge the cleaned data with the counties shapefilecounty_map_preterm_data <-left_join(counties_sf, preterm_rate_data, by =c("GEOID"="County.of.Residence.Code"))county_map_cleaned_data <-left_join(counties_sf, cleaned_data, by =c("GEOID"="County.of.Residence.Code"))# Integrate with US Census Data
Analysis plan
Exploratory analysis of data: Generate county-level choropleth maps of birth rates ((births/total population)*1000 for each county). Generate county-level choropleth maps of preterm birth rates (<37 gestational age births/total population). Generate sex-stratified maps of preterm birth rates. Generate sex-stratified choropleth maps of NICU admissions by county. Generate bar graphs showing NICU admission rates by sex and sex distribution of gestational age groups.
Data analysis: Conduct 2-way ANOVA to determine if sex as a biological variable is a significant predictor of NICU admission. Identify counties with highest and lowest preterm birth rates, and the highest and lowest NICU admission rates.
1.4 Results
First, I developed a county-level choropleth maps of birth rates (live births per 1000 people) for counties that did not have any missing data. The below map represents 496 counties. Rockland County, NY has the highest birth rate (19.4 per 1000 people) while Charlotte County, FL has the lowest birth rate (4.99 per 1000 people).
The following object is masked from 'package:lubridate':
stamp
library(nhanesA)library(haven)library(viridis)
Loading required package: viridisLite
# Convert Birth.Rate to numeric (if necessary)county_map_cleaned_data$Birth.Rate <-as.numeric(county_map_cleaned_data$Birth.Rate)# Recheck the structure of the columnstr(county_map_cleaned_data$Birth.Rate)
num [1:11636] NA NA NA NA NA ...
# Remove rows where Birth.Rate is NAcounty_map_cleaned_data <- county_map_cleaned_data %>%filter(!is.na(Birth.Rate))#load map thememap_theme <-function() {theme_minimal() +theme(axis.line =element_blank(), axis.text =element_blank(), axis.title =element_blank(),panel.grid =element_line(color ="white"), legend.key.size =unit(0.8, "cm"), legend.text =element_text(size =16), legend.title =element_text(size =16) )}#create my palettemyPalette <-colorRampPalette(viridis(100))# Generate the choropleth map for birth ratesbirth_rates_map <-ggplot(data = county_map_cleaned_data) +geom_sf(aes(fill = Birth.Rate)) +# Map Preterm Birth Rate to the fillmap_theme() +# Apply the custom themeggtitle("2023 Birth Rate (live births per 1000 people)") +scale_fill_gradientn(name ="Birth Rate\nrate", colours =myPalette(100))coord_sf(xlim =c(-79.5, -75), ylim =c(37, 39))
<ggproto object: Class CoordSf, CoordCartesian, Coord, gg>
aspect: function
backtransform_range: function
clip: on
crs: NULL
datum: crs
default: FALSE
default_crs: NULL
determine_crs: function
distance: function
expand: TRUE
fixup_graticule_labels: function
get_default_crs: function
is_free: function
is_linear: function
label_axes: list
label_graticule:
labels: function
limits: list
lims_method: cross
modify_scales: function
ndiscr: 100
params: list
range: function
record_bbox: function
render_axis_h: function
render_axis_v: function
render_bg: function
render_fg: function
setup_data: function
setup_layout: function
setup_panel_guides: function
setup_panel_params: function
setup_params: function
train_panel_guides: function
transform: function
super: <ggproto object: Class CoordSf, CoordCartesian, Coord, gg>
# Print the mapprint(birth_rates_map)
#How many counties are represented in this analysis?num_unique_geoid_counties <-length(unique(county_map_cleaned_data$GEOID))print(num_unique_geoid_counties)
[1] 496
#Which counties have the highest and lowest birth rates?# Find the county with the highest Birth Rate and its value (show only the first entry)highest_birth_rate <- county_map_cleaned_data %>%filter(Birth.Rate ==max(Birth.Rate, na.rm =TRUE)) %>%slice(1) # Select the first rowcat("County with the highest Birth Rate: ", highest_birth_rate$County.of.Residence, "\n")
County with the highest Birth Rate: Rockland County, NY
# Find the county with the lowest Birth Rate and its value (show only the first entry)lowest_birth_rate <- county_map_cleaned_data %>%filter(Birth.Rate ==min(Birth.Rate, na.rm =TRUE)) %>%slice(1) # Select the first rowcat("County with the lowest Birth Rate: ", lowest_birth_rate$County.of.Residence, "\n")
County with the lowest Birth Rate: Charlotte County, FL
To better visualize and explore the birth rates, I then generated an interactive choropleth map of birth rates by county.
#generate interactive choropleth map library(leaflet)library(RColorBrewer)# Define the color scale for the Birth Rate using colorBinpal_fun <-colorBin(palette =brewer.pal(9, "YlOrRd")[c(1:5, 7)], bins =c(0, 5, 10, 15, 20, 25), reverse =FALSE)# Create the popup message for each countypu_message <-paste0(county_map_cleaned_data$County.of.Residence, "<br>Birth Rate: ", round(county_map_cleaned_data$Birth.Rate, 1), " live births per 1000 people")# Create the leaflet map with the polygonsleaflet(county_map_cleaned_data) |>addPolygons(stroke =FALSE, fillColor =~pal_fun(Birth.Rate),fillOpacity =0.7, smoothFactor =0.5,popup = pu_message) |>addProviderTiles(providers$CartoDB.Positron) |>addLegend("bottomright", pal = pal_fun, values =~Birth.Rate, title ='Birth Rate', opacity =1) |>addScaleBar()
Warning: sf layer has inconsistent datum (+proj=longlat +datum=NAD83 +no_defs).
Need '+proj=longlat +datum=WGS84'
I repeated the analysis to generate county-level choropleth maps of preterm birth rates (<37 gestational age births/total number of live births).
# Convert Preterm.Birth.Rate to numericcounty_map_preterm_data$Preterm.Birth.Rate <-as.numeric(county_map_preterm_data$Preterm.Birth.Rate)# Recheck the structure of the columnstr(county_map_preterm_data$Preterm.Birth.Rate)
num [1:3222] NA NA NA NA NA ...
# Remove rows where Birth.Rate is NAcounty_map_preterm_data <- county_map_preterm_data %>%filter(!is.na(Preterm.Birth.Rate))# Generate the choropleth map for preterm birth ratespretetrm_birth_rates_map <-ggplot(data = county_map_preterm_data) +geom_sf(aes(fill = Preterm.Birth.Rate)) +# Map Preterm Birth Rate to the fillmap_theme() +# Apply the custom themeggtitle("2023 Preterm Birth Rate % (<37 gestational age births/total number of live births)") +scale_fill_gradientn(name ="Preterm Birth Rate\nrate", colours =myPalette(100))coord_sf(xlim =c(-79.5, -75), ylim =c(37, 39))
<ggproto object: Class CoordSf, CoordCartesian, Coord, gg>
aspect: function
backtransform_range: function
clip: on
crs: NULL
datum: crs
default: FALSE
default_crs: NULL
determine_crs: function
distance: function
expand: TRUE
fixup_graticule_labels: function
get_default_crs: function
is_free: function
is_linear: function
label_axes: list
label_graticule:
labels: function
limits: list
lims_method: cross
modify_scales: function
ndiscr: 100
params: list
range: function
record_bbox: function
render_axis_h: function
render_axis_v: function
render_bg: function
render_fg: function
setup_data: function
setup_layout: function
setup_panel_guides: function
setup_panel_params: function
setup_params: function
train_panel_guides: function
transform: function
super: <ggproto object: Class CoordSf, CoordCartesian, Coord, gg>
pretetrm_birth_rates_map
#How many counties are represented in this analysis?num_unique_geoid_counties_preterm <-length(unique(county_map_preterm_data$GEOID))print(num_unique_geoid_counties_preterm)
[1] 496
#Which counties have the highest and lowest preterm birth rates?# Find the county with the highest preterm Birth Rate and its value (show only the first entry)highest_preterm_birth_rate <- county_map_preterm_data %>%filter(Preterm.Birth.Rate ==max(Preterm.Birth.Rate, na.rm =TRUE)) %>%slice(1) # Select the first rowcat("County with the highest preterm Birth Rate: ", highest_preterm_birth_rate$County.of.Residence, "\n")
County with the highest preterm Birth Rate: Kanawha County, WV
# Find the county with the lowest preterm Birth Rate and its value (show only the first entry)lowest_preterm_birth_rate <- county_map_preterm_data %>%filter(Preterm.Birth.Rate ==min(Preterm.Birth.Rate, na.rm =TRUE)) %>%slice(1) # Select the first rowcat("County with the lowest preterm Birth Rate: ", lowest_preterm_birth_rate$County.of.Residence, "\n")
County with the lowest preterm Birth Rate: Ocean County, NJ
Describe your results and include relevant tables, plots, and code/comments used to obtain them. You may refer to the Section 1.3 as needed. End with a brief conclusion of your findings related to the question you set out to address. You can include references if you’d like, but this is not required.
1.5 Conclusion
This the conclusion. The Section 1.4 can be invoked here.